Decomposition aided attention-based recurrent neural networks for multistep ahead time-series forecasting of renewable power generation

Renewable energy plays an increasingly important role in our future. As fossil fuels become more difficult to extract and effectively process, renewables offer a solution to the ever-increasing energy demands of the world. However, the shift toward renewable energy is not without challenges. While fossil fuels offer a more reliable means of energy storage that can be converted into usable energy, renewables are more dependent on external factors used for generation. Efficient storage of renewables is more difficult often relying on batteries that have a limited number of charge cycles. A robust and efficient system for forecasting power generation from renewable sources can help alleviate some of the difficulties associated with the transition toward renewable energy. Therefore, this study proposes an attention-based recurrent neural network approach for forecasting power generated from renewable sources. To help networks make more accurate forecasts, decomposition techniques utilized applied the time series, and a modified metaheuristic is introduced to optimized hyperparameter values of the utilized networks. This approach has been tested on two real-world renewable energy datasets covering both solar and wind farms. The models generated by the introduced metaheuristics were compared with those produced by other state-of-the-art optimizers in terms of standard regression metrics and statistical analysis. Finally, the best-performing model was interpreted using SHapley Additive exPlanations.


INTRODUCTION
The role of renewable energy is a paramount factor in sustainability of the society.Traditional energy systems based on fossil fuels are not efficient and require more complicated processes of extraction.The demands of human civilization are always growing, which exposes the difficulties for eco-friendly energetic growth.As renewable energy source (RES) become more available the distribution of new resources in the network result in stochasticity, intermittency, and uncertainty.Consequentially, the traditional energy systems are dominant in the share of energy used amounting to 81% of the global share (Loe, 2022).
For RES to become more widely utilized, the previously mentioned challenges need to be overcome.Additionally, energy storage on a smaller scale remains difficult when working with RES, in comparison to fossil fuel storage which is still considered more reliable.The storage of electricity is mostly achieved by batteries which are a limited resource on their own due to the limited number of life cycles for each one of them (Zhang & Zhao, 2023).All things considered, a possible solution is a mechanism that can provide accurate forecasts of the amount of resources being generated from RES.Such a solution would have to be able to analyze short-term time series and provide a robust mechanism as it affects electricity load and its price.Electricity traders and system operators are most affected by these changes.
Traditional methods for regression have previously been applied to forecasting RES power production (Foley et al., 2012;Abuella & Chowdhury, 2015) However, as the world's need for energy increases further improvements are needed in order to make forecasting methods viable.A major challenge when tackling RES production forecasting comes from the noisy nature of the data.Since renewable resources rely on natural phenomena such as wind or solar exposure, many chaotic factors play a role in the amount of power that can be produced.Nevertheless, patterns in this data are still present, though often difficult to initially observe.
By applying advanced signal processing techniques, such as decomposition techniques, strong signals can be separated from the noise, allowing prediction methods to focus on determining correlations between signals with strong patterns rather than those heavily affected by the noise.This concept has often been applied to systems that require precise moments in noise environments such as electroencephalography (Murariu, Dorobanţu & Tărniceriu, 2023) demonstrating great potential.Several decomposition techniques have been developed in recently such as empirical mode decomposition (EMD) (Boudraa & Cexus, 2007) and ensemble empirical mode decomposition (EEMD) (Wu & Huang, 2009).While efficient, the lack of a strong mathematical background in these methods has led to the development of variational mode decomposition (VMD) (Dragomiretskiy & Zosso, 2013) that has shown great potential for tackling signal decomposition with a strong mathematical basis (Liu et al., 2022;Zhang, Peng & Nazir, 2022;Gao et al., 2022).
One additional approach that has shown great potential when working with data catheterized by complex nonlinear relations is the application of artificial intelligence (AI).Powerful AI algorithms are capable of improving their performance through an iterative data-driven process.By observing data AI algorithms can determine correlations without explicit programming.This makes AI a promising approach for tackling this pressing issue.Nevertheless, the modern algorithms' performance is reliant on proper hyperparameter selection.With increasing numbers of hyperparameters, traditional methods such as trial and error have become insufficient to optimize algorithm performance.The use of metaheuristic optimization algorithms provides a potential solution for efficient hyperparameter selection.
Forecasting power generation is regarded as a time series forecasting challenge.By doing so, algorithms capable of responding to data sequences can be leveraged in order to make more accurate forecasts.One promising approach, that extensive literature review suggests has not yet sufficiently been explored when applied to renewable forecasting, is the use of recurrent neural networks (RNN) (Medsker & Jain, 1999).These networks represent a variety of artificial neural networks (ANN) that allow previous inputs to affect future outputs, making them highly suitable for time series forecasting.A recent improvement incorporates attention mechanisms (Olah & Carter, 2016) into RNN allowing networks to focus their attention on specific features improving accuracy.Additionally, the literature review suggests that attention-based RNNs (RNN- ATT) have not yet been applied to renewable power forecasting, indicating a gap in research that this work hopes to address.Exploring the potential of these networks is essential as a robust forecasting method could help make RES more viable and lower the world's dependence on fossil fuels.
This research proposes an approach that applies a neural network model based on attention for that purpose.Moreover, the proposed model was applied to two different problems including the Spain wind and solar energy predictions and the wind farms in China predictions.Datasets for both countries' surveys have been used with the RNN model and the attention-based recurrent neural network RNN-ATT.However, these networks require fine-tuning of a large number of hyperparameters, that can result in nondeterministic polynomial time complexity (NP-hard).Hyperparameter optimization is done through the use of metaheuristics, and a modified version of the well-known Harris hawk optimization (HHO) (Heidari et al., 2019) algorithm is introduced.Two sets of experiments have been carried out both with RNN and RNN-ATT networks, applied to each real-world dataset.
This research is an extension of previous researches in this domain (Bacanin et al., 2023c;Stoean et al., 2023;Bacanin et al., 2023b), where the long short-term (LSTM), bidrectional LSTM (BiLSTM) and gated recurrent unit (GRU) were applied for RES forecasting challenges.However, the goal of this work is to test lighter models (classical RNNs) for problems of RES with the application of fewer neurons over layers while providing satisfactory performance.Additionally, conversely to previous experimentation, current research also investigates the potential of RNNs with attention mechanism and it was validated against different RES time-series datasets.Also, the classical RNNs (without attention mechanism) were also validated in order to establish the influence of attention layer to overall network performance.
The primary contributions of this work can be summarized as the following: The RNN-ATT-based method for forecasting RES power generation.
A modified version of a metaheuristic tasked with selecting network parameters.
The application of the introduced approach to two real-world datasets to determine their potential for real-world use.
The interpretation of the best generated RNN models that can be used as a valuable tools for renewable energy specialists to determine which factor has the most influence on the RES performance.
The structure of the article includes "Background and Preliminaries" for providing the technological fundamentals for the performed experiments."Proposed Method" explains the original version of the applied metaheuristic as well as the modified version."Dataset Description and Experiments" explains the utilized datasets in detail and gives information on the test setup.The outcomes are presented in "Results and Comparison", followed by a discussion.statistical validation and model interpretation presented in "Discussion, Statistical Validation and Interpretation".Finally, "Conclusions" concluded the work and presents potential future research.

BACKGROUND AND PRELIMINARIES
This section introduces techniques required for the reader to have a full and insightful understanding of experiments conducted in this research.

Time-series decomposition and integration
Time-series decomposition is a technique used to break down a time-series data into its constituent components, such as trend, seasonality, and residual (noise).By decomposing a time-series, we can better understand the underlying patterns and relationships within the data, which can, in turn result in improvements of reliability and accuracy of the timeseries forecasting, models like the Luong attention-based RNN model.

Decomposition techniques
Various decomposition techniques can be applied to time-series data, including: 1. Classical decomposition: This method decomposes a time-series into its trend, seasonal, and residual components using moving averages and seasonal adjustments.There are two primary approaches in classical decomposition: additive and multiplicative.In the additive decomposition, the time-series is expressed as the sum of its components, while in the multiplicative decomposition, the time-series is expressed as the product of its components.
2. Seasonal and trend decomposition using Loess (STL): STL is a flexible and robust decomposition method that uses locally weighted regression (Loess) to estimate the trend and seasonal components of a time-series.It can handle both constant and time-varying seasonality, as well as arbitrary patterns of missing data.The STL method also allows for user-defined control over the smoothness and periodicity of the seasonal and trend components.
3. Seasonal decomposition of time series (SDTS): SDTS is an extension of the classical decomposition method that incorporates a seasonal adjustment factor for each observation in the time-series.This factor is obtained by dividing the observed value by the corresponding seasonal component.The seasonal adjustment factors can be used to deseasonalize the time-series, which can then be analyzed for trend and residual components.
4. Wavelet transform: Wavelet transform is a mathematical technique used to decompose a time-series into a set of wavelet coefficients, which represent the time-series at different scales and resolutions.Wavelet transform can capture both the low-frequency (trend) and high-frequency (seasonal and noise) components of a time-series, making it a powerful tool for time-series decomposition and analysis.
5. Empirical mode decomposition: EMD is a powerful and flexible technique for analyzing non-stationary and non-linear time series data.Introduced by Huang et al. (1998), EMD is designed to adaptively decompose a time series into a finite set of intrinsic mode functions (IMFs) that capture the local oscillatory behavior of the signal at various scales.The primary goal of EMD is to provide a data-driven decomposition that does not rely on any predefined basis functions or assumptions about the underlying signal characteristics (Abayomi-Alli et al., 2020).By incorporating EMD into the renewable power generation forecasting process, we can potentially enhance the accuracy, reliability, and interpretability of the forecasting models, ultimately aiding in the efficient management and planning of renewable energy resources.

Variational mode decomposition
The VMD (Dragomiretskiy & Zosso, 2013) technique used for signal decomposition builds upon the solid foundation established but other methods.However, VMD does so with a strong mathematical foundation compared to empirical techniques.Signal modes of varying frequencies are extracted from the original signal original signals by finding modes that are orthogonal to each other with localized frequency content.The decomposition is achieved through progressive optimization according to Eq. (1).
in which VðtÞ are signal modes, V 0 ðtÞ denotes the derivative of VðtÞ with respect to time.
Additionally the regularization parameter l balances between extracted mode smoothness and sparsity.Accordingly, function UðVðtÞÞ promotes sparsity.The decomposition process is handled by an algorithm that switches between solving modes and determines the penalty.Minimizing the energy function modes can be determined with respect to VðtÞ.A Lagrange multiplier aðtÞ is also introduced giving Eq. (2).
where the k-th mode of a signal is represented by V k ðtÞ.In order to revise the penalty function, the energy function is minimized with respect to aðtÞ.To accomplish this, the derivative of EðVÞ with respect to aðtÞ is set to zero.The resulting function is shown in Eq. ( 3) with the k constraint defining the overall mode energy.

Integration of decomposed components
Once the time-series has been decomposed into its constituent components, the next step is to integrate these components into the forecasting model.There are several ways to incorporate the decomposed components into the Luong attention-based RNN model: 1. Component-wise modeling: Train separate RNN models for each of the decomposed components (trend, seasonal, and residual), and then combine the forecasts from these models to obtain the final forecast for the original time-series.This approach can help in capturing the unique patterns and dependencies within each component more effectively.
2. Feature augmentation: Use the decomposed components as additional input features to the RNN model, along with the original time-series.This approach can help the model in learning the relationships between the decomposed components and the target variable, potentially improving the model's forecasting performance.
3. Preprocessing: Deseasonalize the time-series by removing the seasonal component before training the RNN model, and then add back the seasonal component to the model's forecasts to obtain the final forecast for the original time-series.This approach can help in reducing the complexity of the time-series and make it easier for the model to capture the underlying trend and residual patterns.
4. Postprocessing: Train the RNN model on the original time-series, and then adjust the model's forecasts using the decomposed components (e.g., by adding the seasonal component to the model's forecasts).This approach can help in correcting the model's forecasts for any systematic errors or biases related to the seasonal component.

Recurrent neural network
Time series prediction is the motivation for the improvements in artificial neural networks (ANN) (Pascanu, Mikolov & Bengio, 2013).The difference from the multilayer perceptron is that the hidden unit links are enabled with a delay.The results of such modifications allow the model to be sensitive toward temporal data occurrences of greater length.
RNNs are considered as a high-performing solution but further improvements were applied to achieve even greater performance.The main issues are the exploding and vanishing gradient.The solution was provided with LSTM model.The reason for not using the latest solution is that sometimes RNNs tend to outperform LSTMs as they introduce a large number of hyperparameters that can sometimes hinder performance (Bas, Egrioglu & Kolemen, 2021).
The advantage of the RNN as well is that it does not have to take inputs of fixed vector length, in which case the output has to be fixed as well.While working with rich structures and sequences this advantage can be exploited.In other words, the model works with input vectors and is able to generate sequences on the output.The RNN processes the data of the sequence while the hidden state is held.

Luong attention-based model
The attention phenomenon is not defined by mathematics and its application in the Luong attention-based model should be considered as a mechanism (Luong, Pham & Manning, 2015;Raffel et al., 2017;Harvat & Martín-Guerrero, 2022).Some examples of different mathematical expression applications of the attention mechanism are the sliding window methods, saliency detection, local image features, etc. Regarding the attention mechanism application in the case of an RNN, the definition is precise.
The networks that can work with the attention mechanism and possess RNN characteristics are considered attention-based.The purpose of such a mechanism is to work with different weights for the sequence in input.The data can be captured as a result and input-output relations are usable.The basic solution of such architecture is the application of a second RNN.
The authors chose the Luong attention-based model for that purpose.Weight represented as w t is calculated for the source for every timestep t for the decoding of attention-based encoder-decoder as AE s w t ðsÞ ¼ 1 and 8s w t ðsÞ !0. The hidden state h t has a function that is the related timestep's predicted token, while the AE s w t ðsÞ Ã ĥs .
Different mathematical applications of the attention mechanism differ in the way they compute weights.In the case of the Luong model, it is the softmax function on the scaled scores of each token.Matrix W a linearly transforms the decoder's h t dot product and the encoder ĥs to calculate the score.

Hyperparameters of luong-attention based RNN
The Luong attention-based RNN model is an extension of the basic RNN model with the addition of an attention mechanism allows for selective focus on particular parts of the input sequence upon output generation.The following hyperparameters are typically involved in the configuration of the Luong attention-based RNN model: 1. Number of hidden layers (n hid ): The number of hidden layers in the RNN architecture, which determines the depth of the model.More hidden layers can enable the model to capture patterns of higher complexity and data dependencies but with the risk of overfitting and requiring more computational resources.
2. Number of hidden units per layer (n unit ): The number of hidden units (neurons) in each hidden layer of the RNN.A larger number of hidden units can increase the model's capacity to learn complex patterns, but it may also increase the risk of overfitting and require more computational resources.
3. Type of RNN cell: The choice of RNN cell used in the model, such as LSTM or GRU.These cells are designed to better handle long-range dependencies and mitigate the vanishing gradient problem compared to the traditional RNN cells.
4. Attention mechanism: The specific attention mechanism used in the model.In the case of the Luong attention-based RNN model, the attention mechanism can be of two types: global or local attention.Global attention attends to all the source positions, while attention is focused localy only on a small window of source positions around the current target position.

Attention scoring function:
The scoring function computes the alignment scores between the source and target sequences in the attention mechanism.Luong, Pham & Manning (2015) proposed three different scoring functions: dot product, general (multiplicative), and concatenation (additive).The choice of scoring function can affect the model's performance and interpretability.
6. Learning rate (a): The learning rate is a critical hyperparameter in control of the size of updates to the model's weights during the training process.A smaller learning rate might lead to more precise convergence but require more training iterations, while a larger learning rate may speed up the training process but risk overshooting the optimal solution.
7. Dropout rate (p drop ): The dropout rate is a technique of regularization used to prevent overfitting in neural networks.During training, a fraction of the neurons in the network is randomly "dropped out" or deactivated, with the specified dropout rate determining the proportion of neurons deactivated at each training iteration.
8. Batch size: The number of training samples used in a single update of the model's weights.A larger batch size can lead to more accurate gradient estimates and faster training but may require more memory and computational resources.
9. Sequence length: The length of input and output sequences used in the model.Longer sequences may allow the model to capture more extensive temporal dependencies but can also increase the computational complexity and risk of overfitting.

Metaheuristic optimization
In recent years model optimization has become a popular topic in computer science.Increasing model complexity, as well as growing numbers of hyperparameters of modern algorithms, has made it necessary to develop techniques to automate this process, which was traditionally handled through trial and error.However, this is a challenging task, as selecting optimal parameters is often a mixed NP-hard problem, with both discrete and continuous values having a role to play in defining model performance.A powerful group of algorithms capable of addressing NP-hard problems within reasonable time constraints and with realistic computational demands are metaheuristic optimization algorithms.By formulating the process of parameter selection as an optimization task, metaheuristics can be employed to efficiently improve performance.A notably popular group of metaheuristics is swarm intelligence that models observed behaviors of cooperating groups to perform optimizations.Some notable algorithms that have become popular for tacking optimization tasks among researchers include the HHO (Heidari et al., 2019), genetic algorithm (GA) (Mirjalili & Mirjalili, 2019), particle swarm optimizer (PSO) (Kennedy & Eberhart, 1995), artificial bee colony (ABC) (Karaboga, 2010) algorithm, firefly algorithm (FA) (Yang & Slowik, 2020).Additionally the LSHADE for Constrained Optimization with Levy Flights (COLSHADE) algorithm (Gurrola-Ramos, Hernàndez-Aguirre & Dalmau-Cedeño, 2020) and Self-Adapting Spherical Search (SASS) (Zhao et al., 2022) are notable recent examples of optimizers.These algorithms, and algorithms derived from their base have been applied in several fields with promising outcomes.Some noteworthy examples of metaheuristics applied to optimization problems include examples for crude oil price forecasting (Jovanovic et al., 2022;Al-Qaness et al., 2022), Ethereum and Bitcoin prices predictions (Stankovic et al., 2022b;Milicevic et al., 2023;Petrovic et al., 2023;Gupta & Nalavade, 2022), industry 4.0 (Jovanovic et al., 2023b;Dobrojevic et al., 2023;Para, Del Ser & Nebro, 2022), medicine (Zivkovic et al., 2022a;Bezdan et al., 2022;Budimirovic et al., 2022;Stankovic et al., 2022a), security (Zivkovic et al., 2022b;Savanović et al., 2023;Jovanovic et al., 2023c;Zivkovic et al., 2022c), cloud computing (Thakur & Goraya, 2022;Mirmohseni, Tang & Javadpour, 2022;Bacanin et al., 2022d;Zivkovic et al., 2021), and environmental sciences (Jovanovic et al., 2023d;Bacanin et al., 2022b;Kiani et al., 2022).

PROPOSED METHOD
This section begins with a short overview of the basic HHO algorithm along the explanation and justifications of the modifications that were made to the original method.

Original Harris hawk optimization
The inspiration for the HHO are the attack strategies of the bird with the same name.The phases of attacks can be differentiated as exploration, the transition to exploitation, and the exploitation.The algorithm was introduced by Heidari et al. ( 2019) and has been used for a wide variety of optimization-related applications such as machine scheduling (Jouhari et al., 2020) and neural network optimization (Ali et al., 2022).
In the first phase, the exploration, the goal is the global optimum.Multiple locations in the population serve for random initialization which mimics the hawk's search for prey.The parameter q controls this process as it switches between two strategies of equal probability: Xðt þ 1Þ ¼ X rand ðtÞ À r 1 jX rand ðtÞ À 2r 2 XðtÞj; q !0 ðX best ðtÞ À X m ðtÞÞ À r 3 ðLB þ r 4 ðUB À LBÞÞ; q , 0:5; (4) in which the random number from the range ½0; 1 are r 1 , r 2 , r 3 , and r 4 as well as q and these numbers are updated on an iteration basis.The position vector of the solution in the next iteration is Xðt þ 1Þ, and the positions of the solutions of the best, current, and average solutions in the current iteration t are given respectively as X best ðtÞ, XðtÞ and X m ðtÞ, while the lower bound is LB and the upper bound is UB.The average position is provided by a simple averaging approach: for which N shows the total solutions number, and the individual X at iteration t is shown as X i ðtÞ.
The term prey energy is introduced as it indicates if the algorithm should revert back to exploration and so forth.The solutions updates strength in each iteration as: for T as iteration maximum for a run, the prey's initial energy E 0 which varies inside the ½À1; 1 interval.The exploitation phase represents the literal attack of the hawk and maps out its behavior as it is closing in.The mathematical translation is given as jEj !0:5 for more passive attacking, and jEj , 0:5 otherwise.
In cases where the prey of the hawk is still at large, the hawks encircle the prey with the goal of exhaustion which is modeled as follows: Xðt þ 1Þ ¼ DXðtÞ À EjJX best ðtÞ À XðtÞj (7) DXðtÞ ¼ X best ðtÞ À XðtÞ; (8) for which the vector difference of the best solution (prey) and the current solution in iteration t is shown as DXðtÞ.The strategy of the prey's escape is controlled by the random attribute J which differs from iteration to iteration: for which the interval ½0; 1Þ maps out the random value r 5 .For r !0:5 and jEj , 0:5 the prey is considered exhausted and more aggressive attack strategies are applied.The current position in this case is updated as: If the prey is still not giving up the hawks apply another attack strategy called zig-zag movements commonly known as leapfrog movements.Following equation evaluates if such behavior should be applied: Y ¼ X best ðtÞ À EjJX best ðtÞ À XðtÞj; (11) while the leapfrog movements are modeled as: in which the problem dimension is given as D, a random vector of 1 Â D size as S, and the levy fligth LF calculated by: Consequently, the position updating mechanism is provided: where the Eqs.( 11) and ( 12) are utilized for calculating the Y and Z. Lastly, for the case of r 0:5 and jEj , 0:5 the prey is considered to be out of energy, and stronger attacks are applied with rapid drive progressively.The distance between the target before its acquisition is modeled as: for which the Y and Z are obtained by the next two equations: Proposed enhanced Harris hawk optimization algorithm

New initialization scheme
The applied approach exploits a novel initialization strategy of populations: in which the j-th component of i-th solution is given as x i;j , the upper and lower bounds are represented by ub j and lb j for the parameter j, and a pseudo-random number is drawn between ½0; 1 and given as w.
The quasi-reflection-based learning (QRL) procedure has proven to give results (Jovanovic et al., 2023b) where applied with the goal of sarge space enlargement for the case of those generated by the (18).The purpose of the QRL procedure is reflected in the fact that if the observed solution falls in the suboptimal region of the search space, there is a fair chance that its opposite will fall in more promising areas of the search domain, as reported by several authors recently (Bacanin et al., 2023a;Basha et al., 2021;Nama, 2022;Çelik, 2023;Lei et al., 2022;Bacanin et al., 2021;Xue, 2022).Hence the x qr j , quasi-reflexiveopposite component for all parameters of a solution x j is provided as in the following equation: while at lb j þub j 2 ; x j h i interval a pseudo-random number is chosen as rnd.

Mechanism for maintaining population diversity
Diversification is observed as a parameter of the convergence/divergence ratio during the search process as in Cheng & Shi (2011).L1 norm (Cheng & Shi, 2011) applies two-component diversification for the solutions and the dimensions of the problem.Important information for the search process can be derived from the dimension-wise metric with the L1 norm.
The number of total individuals is marked with m and the dimensions number as n, the L1 norm is given as in Eqs. ( 20)-( 22): in which every individual's position mean is represented as " x vector over all dimensions, the hawk's position vector of diversity as L1 norm is shown as D p j , while the scalar form is shown as D p for the entire population.Using regular strategies of initialization usually results in higher diversity with weaker convergence towards later iterations.The described metric is used for L1 determination of the threshold D t for the diversity.Firstly, the D t0 is calculated by Eq. 23, which is followed by condition D P , D t for the satisfactory value of diversity, the worst solutions are replaced with randomly generated solutions nrs with the same strategy for population initialization.The nrs value is another control parameter.
The Eq. ( 1) and Algorithm 1 indicate close generation of solutions towards the bounds of the search space's mean.The value D t falls of as shown in: in which the current and subsequent iterations are given as iter and iter þ 1, and the number of iterations at the maximum is T. According to this mechanism, the D t falls off in no relation to the D P and still will not trigger the mechanism.

Inner workings and complexity of proposed method
Taking inspiration from applied mechanisms to the original solution the proposed new algorithm is diversity directed HHO (DDHHO), which is shown in Algorithm 2. It is important to note that the computational complexity of the original algorithm is not lower than that of the novel solution.In modern literature, it is a practice to measure this in FFEs as it is the most resource-demanding technique, hence the complexity of the DDHHO for for (each hawk (X i )) do Update the initial energy E 0 and jump strength J Update the E using Eq. ( 6) Update the location vector using Eq. ( 4) 5 and jEj !0:5) then Update the location vector using Eq. ( 7) else if (r !0.5 and jEj , 0:5) then Update the location vector using Eq. ( 10) else if (r , 0.5 and jEj !0:5) then Update the location vector using Eq. ( 14) else if (r , 0.5 and jEj , 0:5) then Update the location vector using Eq. ( 15) Hyperparameter optimization using HHO To optimize the hyperparameters of the Luong attention-based RNN model, we perform the following steps: Define the search space: Identify the hyperparameters to be optimized and specify their respective ranges or discrete sets of possible values.For instance, for the number of hidden layers, we may specify a range of values, e.g., from 1 to 5. Similarly, we define the search space for other hyperparameters such as the number of hidden units per layer, type of RNN cell, attention mechanism, attention scoring function, learning rate, dropout rate, batch size, and sequence length.
Initialize the population: Generate an initial population of candidate solutions, where each candidate solution represents a combination of hyperparameter values within the defined search space.
Evaluate candidate solutions: For each candidate solution, train the Luong attentionbased RNN model using the specified hyperparameter values, and evaluate the performance on a validation set using one or more performance metrics (e.g., MAE, RMSE, and MAPE).This step may require cross-validation or other validation techniques to obtain reliable performance estimates.
Apply optimization algorithm: Utilize the chosen metaheuristic optimization algorithm for search space exploration and find the best combination of hyperparameter values that minimizes the chosen performance metric(s).In each iteration, the algorithm updates the candidate solutions based on the optimization strategy specific to the chosen algorithm, and the performance of the updated solutions is re-evaluated on the validation set.
Termination condition: The optimization process is ongoing until a predefined termination condition is met, such as a maximum iteration number, a minimum performance improvement threshold, or a predefined computational budget.
Select the optimal solution: Once the termination condition is reached, select the candidate solution with the best performance on the validation set as the optimal combination of hyperparameter values for the Luong attention-based RNN model.
Final model training and evaluation: Train the Luong attention-based RNN model using the optimal hyperparameter values on the entire training set, and evaluate its performance on the test set to obtain an unbiased estimate of the model's forecasting accuracy.

DATASET DESCRIPTION AND EXPERIMENTS
This section aims to provide an overview of the datasets utilized in the experiments and the experimental setup established for all methods employed in the comparative analysis.

Spain solar energy dataset
The first dataset, concerning photovoltaic power generation in Spain, is constructed from real-world originating from two different sources.The ENTSO-E portal (https:// transparency.entsoe.eu/)provides hourly energy demand and generation considering the renewable energy in Spain, while the weather data is provided by OpenWeather API (https://openweathermap.org/guide) for the location of Valencia, Spain.
Considering the large amount of data available, a smaller dataset segment was utilized during experimentation.The datasets cover hourly data from 1.8.2018.to 31.12.2018.and covered a total of 3,670 data points.The hourly metrics that were the most relevant are included for multivariate forecasting as well as the data and support metrics of generated photovoltaic power.The dataset was then further separated and with 70% of the data used for training, 10% for validation, and the remaining 20% for testing.The included features include generated photovoltaic power, as well as humidity, rainfall, cloud cover, and ambient temperature.With the generated photovoltaic power feature being the prediction target.

China wind farm dataset
The Global Energy Forecasting Competition 2012 (GEFCom2012) is a competition that aimed to promote the development of state-of-the-art forecasting models for various aspects of the energy industry.The dataset related to wind farms in China used in a competition (https://www.kaggle.com/competitions/global-energy-forecastingcompetition-2012-load-forecasting/data).Seven wind farms from mainland China were selected and anonymized for this dataset.Power generation data has been normalized as well due to anonymity concerns.
Relevant wind data is collected every 12 h while the dataset includes forecasts in intervals of 24 h.The direction and speed of the wind and meridional wind components are provided as well.The dataset consists of hourly measurements of wind power generation from seven wind farms located in China, spanning from January 1, 2011, to September 30, 2012.Each wind farm has different installed capacities, which makes the forecasting task more challenging.For experimentation, hourly resolution data has been split into predictions of 12 h and then further combined with normalized real-world data of power generation for each farm by the hour.Due to the last year of data not being available, the dataset consists of four years of data.The included features are Wind speed, wind direction, and zonal and meridional wind components for each wind farm while the target feature is the amount of generated power.
The first 70% of the available data points were utilized for training, while the later 10% and 20% were used for validation and testing.

Data preprocessing
Before using the dataset for renewable power generation forecasting, some preprocessing steps may be necessary: 1. Missing data imputation: The dataset may contain values that are missing, which are required to be imputed before using the data for model training and evaluation.Various imputation techniques can be employed, such as linear interpolation or more advanced methods based on machine learning models.

Data splitting:
The division of the dataset into training, validation, and testing subsets.
The training and validation sets can be used for model development and hyperparameter tuning, while the testing set can be used for final performance evaluation of the model's forecasting.
3. Feature engineering: Extract additional features from the dataset that may be relevant for the forecasting task, such as lagged values of wind power, moving averages, or other temporal features that can help in pattern and depenendcy caputring in the data.4. Normalization/standardization: Scale the input features and target variable to ensure that they are on a similar scale, which is able of improving the performance and stability.
Once the dataset is preprocessed, it can be used to train and evaluate various forecasting models, such as the Luong attention-based RNN model discussed earlier.By incorporating techniques like time-series decomposition, attention mechanisms, and hyperparameter optimization, the forecasting models can be tailored to the specific characteristics and challenges of the wind power generation data, ultimately improving the accuracy and reliability of the forecasts.

Experimental setup
The following setup regards all four test cases that have been executed.Two stages are differentiated during experimentation.During the first, the data is decomposed for both test cases.Afterward, the signal components and residual signals are provided to the RNN for forecasting.VMD was employed for feature engineering, and min-max scaling was utilized as scaling option.Every tested model was provided in the same manner with historic data of six input points per model for three steps ahead predictions.
The data was split in the same manner for all four test cases, with the training set amounting to 70%, the validation set of 10%, and the testing set of 20%.The split of each the solar dataset target features is visualized with Fig. 1 to illustrate the time intervals that were employed in each of the three mentioned subsets.Similarly, the wind dataset is shown in Fig. 2.
The parameters for the VMD were empirically established and the parameter K ¼ 3, while the alpha parameter represents the length of the used dataframe.To ensure the objectivity of model evaluation 30 independent runs were performed due to the stochastic nature of the optimization algorithms.The selected parameters for optimization of the RNN are given in the following text due to their impact on the performance of the model.The ranges of the parameters alongside their descriptions are given: [50, 100] number of neurons, [0.0001, 0.01] learning rate, [100,300] training epochs, [0.05, 0.1] dropout rate, and ½1; 3 for the total layer number of a network.
Lastly, an early stopping mechanism is incorporated for overfitting prevention with the threshold empirically determined as epochs 3 .The purpose of such a mechanism is to terminate the model early if no improvements are observed for epochs 3 .It should be noted that computational resource waste is reduced as an effect of this approach.
This study employs five performance metrics commonly used to evaluate the accuracy and effectiveness of the proposed attention-based recurrent neural network (A-RNN) model for renewable power generation forecasting.These performance metrics are mean absolute error (MAE), root mean squared error (RMSE), mean absolute error (MAE), Coefficient of determination (R 2 ) and the index of alignment (IA).
MAE is the average of the absolute differences between the predicted values and the actual values.It measures the magnitude of errors in the forecasts without considering their direction.The MAE is defined as: for which the N represents data points total, y i the actual value, and ŷi the predicted value.RMSE is the square root of the average of the squared differences between the predicted values and the actual values.It provides a measure of the overall model's performance by penalizing larger errors more than smaller errors.The RMSE is defined as: MAE is the average of the absolute differences between the predicted values and the actual values.It can be useful for comparing the performance of different models across various scales.The MAE is defined as: where the || denotes the absolute value.R 2 indicates the proportion of the variance in the dependent variable that can be explained by the independent variables in the model.It ranges from 0 to 1, with higher values indicating a better fit between the model and the data.R 2 is defined as: where the " y refers to the mean of the actual values.IA measures the extent to which the model's predicted outcomes align with the true outcomes or the intended goals.A higher Alignment Index indicates a stronger alignment, suggesting that the model is performing well.AI is defined as: These performance metrics, MAE, RMSE, and MAPE, are used to evaluate the accuracy and effectiveness of the proposed A-RNN model in comparison to the regular RNN model for renewable power generation forecasting.A lower value for each metric indicates better forecasting performance.
A flowchart of the utilized experimental framework is provided in Fig. 3.

RESULTS AND COMPARISON
This section exhibits obtained experimental findings in terms of captured performance metrics.The best metrics in all tables were marked with bold style to more clearly visualize the best performing methods.

Spain solar energy forecasting
In Table 1 the objective function outcomes for the best, worst, mean, and median executions, alongside the standard deviance with variance are shown for 30 independent runs of each metaheuristic.As Table 1 suggests, the introduces algorithms attained the best results when optimizing a RNN in the best run.However, admirable stability was demonstrated by the PSO.Furthermore, when considering the worst case execution the ABC attained the best results as well as in the mean and median runs.This is to be expected as per the NFL (Wolpert & Macready, 1997) no single approach works equally well in all execution cases.Further detailed metrics for the best run, for each forecasting step and every tested metaheuristic are demonstrated in Table 2.
As it can be observed from Table 2 the introduced method attained the best overall results in all cases except the R 2 metric, where the PSO attained better results.As the guiding objective function during the optimization process was MSE this is to be expected.Additionally the introduced method also attained the best results when making forecasts two steps ahead, as well MAE for one step ahead.The best results for R 2 , MSE and IA where attained by the GA, while the best RMSE results where attained by the PSO.182,843.648288 203,363.603179 193,893.148713 209,089.779959 189,095.658509 203,815.817799 208,289.867841 205,318 Nevertheless when making forecasts three steps ahead the PSO attained the best results across all metrics except R 2 where the FA attained the best outcomes.
To help demonstrated the improvements made by the introduced method visualizations are provided for the distribution of both MSE and R 2 are shown in Fig. 4 followed by convergence plots for both functions in Fig. 5 and swarm and KDE plots in Fig. 6.
Finally, the parameters selected by each metaheuristic for their respective best models are shown in Table 3.
Similarly to the previous experiment, in Table 4 the objective function outcomes for the best, worst, mean, and median executions, alongside the standard deviance with variance are shown for 30 independent runs of each metaheuristic.
Interestingly, when optimizing the RNN-ATT models, the introduced metaheuristic demonstrated better performance overall most metrics.However, the ABC and SASS algorithms demonstrated a slightly higher degree of stability despite attaining less impressive results.
Further detailed metrics for the best run, for each forecasting step and every tested metaheuristic are demonstrated in Table 5.
As it can be observed in Table 5 the introduces method attained the best overall results for MSE and MAE, while the HHO attained the best IA results, the ABC attained the best R 2 outcomes overall, while SASS attained the best outcomes for MAE.The introduced approach demonstrated the best performance when making predictions one step ahead, while two step ahead forecasts are done best by the PSO.No single approach performed the best for three steps ahead, while different metaheuristics attaining first place in different metrics further enforcing the NFL (Wolpert & Macready, 1997) theorem.
Visualizations of objective function and R 2 distributions are shown in Fig. 7 followed by their respective convergence graphs in Fig. 8.The KDE and swarm plots are also provided in Fig. 9.The parameters selected by each competing metaheuristic for their respective bestperforming models are shown in Table 6.
In Table 7 the objective function outcomes for the best, worst, mean, and median executions, alongside the standard deviance with variance are shown for 30 independent runs of each metaheuristic forecasting wind power generation.

China wind farm forecasting
The introduced metaheuristic attained the best outcomes in the best, mean and median executions, with the ABC attained the best outcomes in the worst case executions.Furthermore, the highest stability was demonstrated by SASS.
Further detailed metrics for the best run, for each forecasting step and every tested metaheuristic are demonstrated in Table 8.
As demonstrated in Table 8, the introduced metaheursitic outperformed all competing metaheuristic in overall outcomes.THe introduces metaheuristic demonstrated the best results for one step ahead forecasts¿ However, the PSO attained the best results for two steps ahead forecasts, and COLSHADE attained the best outcomes for three steps ahead.These results further reinforce that no single approach is equally suited to all use-cases as   The network hyperparameters selected by each metaheuristic for the respective best performing models are shown in Table 9.
Table 5 The VMD-RNN-ATT solar energy metrics per each step.
Step  As it can be observed in Table 10 the introduced metaheuristic attained the best outcomes in all except the medial case, where the ABC algorithms attained the best results.Further detailed metrics for the best run, for each forecasting step and every tested metaheuristic are demonstrated in Table 11.As Table 11 demonstrates, the introduces algorithms performed admirably, attaining the best outcomes on overall evaluations as well as two and three step ahead.The original HHO performed marginally better in one step ahead forecasts when considering at the MAE and IA metrics.Further distribution and convergence graphs for the objective function and R 2 are shown in Figs. 13 and 14.Accompanying KDE and swarm diversity plots are given in Fig. 15.
Finally, the selected parameter for the best performing models optimized by each metaheuristic are shown in Table 12.

DISCUSSION, STATISTICAL VALIDATION AND INTERPRETATION
This section presents a discussion of the advantages of the techniques employed in the conducted research, as well as the statistical analysis of the methods used for comparisons, and the interpretation of the best models generated for both datasets.
Benefits of using attention mechanism for renewable power generation forecasting The attention mechanism has emerged as a powerful tool in the field of machine learning, particularly for sequence-to-sequence learning problems like renewable power generation forecasting.By selectively focusing on different parts of the input sequence when generating the output, the attention mechanism can enhance the performance of forecasting models like the Luong attention-based RNN model.Below, we discuss the key benefits of using attention mechanisms for renewable power generation forecasting: 1. Improved long-term dependency handling: Renewable power generation data often exhibit long-term dependencies due to factors like seasonal patterns and weather trends.Traditional RNN models can struggle to capture these long-term dependencies effectively, leading to suboptimal forecasts.The mechanism of attention introduces different importance weights for seperate input sequence parts, enabling it to focus on the most relevant information for generating the output, thus better handling long-term dependencies.
2. Enhanced forecasting accuracy: The attention mechanism can lead to more accurate forecasts by enabling the model to focus on the most relevant parts of the input sequence when generating the output.This selective focus allows the model to capture the underlying patterns and relationships within the renewable power generation data more effectively, resulting in improved forecasting performance.
3. Interpretability: Attention mechanisms provide a level of interpretability to the model's predictions by highlighting which parts of the input sequence have the most significant impact on the output.This interpretability can be particularly valuable in renewable power generation forecasting, as it allows domain experts to gain insights into the factors influencing the model's forecasts and to validate the model's predictions based on their domain knowledge.
4. Robustness to noise and irrelevant information: Renewable power generation data can be subject to noise and irrelevant information (e.g., due to measurement errors or unrelated external factors).The attention mechanism can help in mitigating the impact of such disturbances on the model's forecasts by selectively focusing on the most relevant parts of the input sequence and down-weighting the influence of noise and irrelevant information.
5. Scalability: Attention mechanisms can scale well with large input sequences, as they allow the model to focus on the most relevant parts of the input sequence without the need to process the entire sequence in a fixed-size hidden state.This scalability can be particularly beneficial for renewable power generation forecasting problems, where the input data may consist of long sequences of historical power generation measurements and environmental variables.
6. Flexibility: Attention mechanisms can be easily incorporated into various RNN architectures, such as LSTM and GRU, providing flexibility in designing and adapting the forecasting model for different renewable power generation scenarios and data characteristics.
An additional note needs to be made on attention mechanisms.The attained results suggest that networks utilizing the attention mechanisms perform slightly worse then the basic RNN.This is likely due to networks with attention layers having a deeper network architecture and thus require more training epochs to improve performance.

Benefits of time series decomposition and integration
Incorporating time-series decomposition and integration into the Luong attention-based RNN model can offer several benefits for renewable power generation forecasting: 1. Improved forecasting accuracy: By decomposing the time-series and accounting for its components, the model can better capture the underlying patterns and dependencies in the data, potentially leading to more accurate and reliable forecasts.
2. Enhanced model interpretability: Decomposition provides insights into the different components of the time-series, making it easier to understand and interpret the model's predictions in terms of trend, seasonality, and residual components.
3. Robustness to noise: By separating the noise component from the trend and seasonal components, the decomposition process can help in reducing the impact of noise and outliers on the model's forecasts, making the model more robust to disturbances.
4. Flexibility and customizability: Decomposition and integration techniques can be adapted and fine-tuned to suit the specific characteristics and requirements of the renewable power generation data, allowing for a more flexible and customizable forecasting approach.
5. Improved model performance: The integration of decomposed components into the RNN model can help in better capturing the relationships between the components and the target variable, potentially leading to improved model performance in terms of generalization and predictive accuracy.

Statistical analysis
When considering optimization problems, assessing models is an important topic.Understanding the statistical significance of the introduced enhancements is crucial.Outcomes alone are not adequate to state that one algorithms is superior to another one.Previous research suggests (Derrac et al., 2011) that a statistical assessment should take place only after the methods being evaluated are adequately sampled.This is done by ascertaining objective averages over several independent runs.Additionally, samples need to originate form a normal distribution so as to avoid misleading conclusions.The use of objective function averages is still for comparison of stochastic methods is still an open question among researchers (Eftimov, Korošec & Seljak, 2017).To ascertain statistical significance of the observed outcomes the best values over 30 independent executions of each metaheuristic have been used for creating the samples.However, the safe use of parametric tests needed to be confirmed.For this, independence, normality, and homoscedasticity of the data variances were considered as recommended by LaTorre et al. (2021).The independence criterion is fulfilled due to the fact that each run is initialized with an pseudo-random number seed.However, the normality condition is not satisfied as the obtained samples do not stem from a normal distribution as shown by the KED plots and proved by the Shapiro-Wilk test outcomes for single-problem analysts (Shapiro & Francia, 1972).By performing the Shapiro-Wilk test, p-values are generated for each method-problem combination, and these outcomes are presented in Table 13.
The standard significance levels of a ¼ 0:05 and a ¼ 0:1 suggest that the null hypothesis (H0) can be refuted, which implies that none of the samples (for any problemmethod combinations) are drawn from a normal distribution.This indicates that the assumption of normality, which is necessary for the reliable use of parametric tests, was not satisfied, and therefore, it was deemed unnecessary to verify the homogeneity of variances.
As the requirements for the reliable application of parametric tests were not met, nonparametric tests were employed for the statistical analysis.Specifically, the Wilcoxon signed-rank test, which is a non-parametric statistical test (Taheri & Hesamian, 2013), was performed on the DDHHO method and all other techniques for all three problem instances (experiments).The same data samples used in the previous normality test (Shapiro-Wilk) were used for each method.The results of this analysis are presented in Table 14, where p-values greater than the significance level of a ¼ 0:05 are highlighted in bold.
Table 14, which presents the p-values obtained from the Wilcoxon signed-rank test, demonstrate that, except for the ABC algorithm in the experiment where VMD-RNN was optimized and validated against solar and wind datasets, the proposed DDHHO method achieved significantly better performance than all other techniques in all three experiments.When compared with ABC, the calculated p-value was slightly above the 0.05 threshold (highlighted in bold in Table 14), suggesting that the DDHHO performed comparably to ABC.This was expected for the solar dataset, since the ABC in this simulation achieved moderately better mean value than the DDHHO, as demonstrated in Table 1.
The p-values for all other methods were lower than 0.05.Therefore, the DDHHO technique exhibited both robustness and effectiveness as an optimizer in these computationally intensive simulations.Based on the statistical analysis, it can be concluded that the DDHHO method outperformed most of the other metaheuristics investigated in all four experiments.
Best model interpretation and feature importance SHAP (Lundberg & Lee, 2017) is a method that can be utilized to interpret the outputs of various AI models.Game theory provides a strong basis for SHAP.Though the use of SHAP the influence real-world factors have on model predictions can be determined.In order to determine the factors that play the highest role in energy production in solar and wind generation the best models with the highest performance output have been subjected to analysis.The outcomes for solar generation are shown in Fig. 16, while wind generation is shown in Fig. 17  As demonstrated by Fig. 16, a significant influence of previous solar generation instances can be observed.Cloud cover and humidity play a minor role in forecasting, with cloud cover decreasing the power produced by the photovoltaic cells.
Indicators form Fig. 17 suggest that when forecasting wind power generation wind direction modes have an important role.However, likely due to the sporadic nature of wind bursts wind generation residuals have the highest impact on predictions.Finally, the meridional followed by zonal wind components pay a minor role in forecasting.

CONCLUSIONS
This study presents a novel attention-based recurrent neural network model for multistep ahead time-series forecasting of renewable power generation, demonstrating improved forecasting accuracy on both Spain's wind and solar energy datasets and China's wind farm dataset.The HHO algorithm is employed for hyperparameter optimization, addressing the challenges posed by the large number of hyperparameters in RNN-type networks.The attention model applied in the second group of experiments provides a weighting system to the RNN, further enhancing the model's performance.The proposed approach has the potential to significantly impact the transition towards a more sustainable future by addressing key challenges related to the storage and management of renewable power generation.
As with any work this research has several limitations.Other methods exist for tackling time-series forecasting and their potential remains yet to be explored.Further potential for improvement exist for the HHO, as well as other metaheuristic algorithms yet to be applied to cloud forecasting.Additionally, other approaches for interpreting feature influence exist such as through the analysis of n-Shapley values.
Future research will focus on refining the HHO algorithm for hyperparameter optimization and exploring additional decomposition methods to further improve the forecasting capabilities of the proposed approach, as well as exploring additional metaheuristics applied to clout load forecasting.Additionally, further methods for feature impact interpretation will be explored.
4: Fitness calculation of every solution in P 5: P sorted by fitness Damaševičius et al. (2024), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.179512/44theworstscenario is Yang & He (2013): OðDDHHOÞ¼ OðNÞ þ OðT Á N 2 Þ.In comparison to other metaheuristics algorithms, the complexity of the DDHHO is similar.For instance, firefly algorithm (Yang & Slowik, 2020) is more complex as it evaluates at most N Ã N solutions in each iteration.Algorithm 2 Pseudo-code of the basic HHO algorithm implementation.Inputs: The population size N and maximum number of iterations T Outputs: The location of the rabbit and its fitness value Initialize the random population X i ði ¼ 1; 2; . . .; NÞ

Table 1
VMD-RNN solar energy forecasting objective function overall outcomes.Table2The VMD-RNN solar energy metrics per each step.

Table 3
Parameters for best performing solar prediction RNN model optimized by each metaheuristic.

Table 4
VMD-RNN-ATT solar energy forecasting objective function overall outcomes.

Table 6
Parameters for best performing solar prediction RNN-ATT model optimized by each metaheuristic.

Table 7
VMD-RNN wind energy forecasting objective function overall outcomes.

Table 8
The VMD-RNN wind energy metrics per each step.

Table 9
Parameters for best performing wind prediction RNN model optimized by each metaheuristic.
Table 10 VMD-RNN-ATT wind energy forecasting objective function overall outcomes.

Table 11 The
VMD-RNN-ATT wind energy metrics per each step.

Table 12
Parameters for best-performing wind prediction RNN-ATT model optimized by each metaheuristic.